This is based on the data file called HW1_asset_prices.csv. This represents the price movements of a set of assets (bonds, stocks etc., their description is quite irrelevant here). Economists and investors are very interested in the correlation of asset prices, both to understand risk, as well as (hopefully) find correlation to lagged asset prices for investing. A correlation matrix with N assets is an N × N matrix of correlations. See https://en.wikipedia.org/wiki/Stock_correlation_network for some background information.
Your task is to visualize the correlation matrix in network form. You are free to use any python package for your calculations in a (numpy, scipy, pandas etc.) as you see fit. For b and c, I expect you to use the matplotlib.pyplot interface in NetworkX. You can gain the +5 point if you (i) submit in Jupyter notebook format (ii) interface to any external drawing packages such as graphviz to call one of their drawing functions.
Get a skeleton code working to do the tasks (can be done in under 10 lines of code with the right built-in functions in pandas and networkx) and then enhance and explore from there on.
a. (10 points) Calculate the correlation matrix (you may have to use your Stats and Econometrics knowledge here)
b. (10 points) With assets as nodes, plot the matrix as a network. Experiment with the different graph layouts and choose the one that you believe is best, and explain why
c. (10 points) Enhance your plot by representing the thickness of the edges and size of the nodes to give some insight.
a.
We first import the necessary libraries. We also use the command "%matpotlib incline" in order to include our graphs in jupyter notebook.
# we first import the necessary libraries
%matplotlib inline
import numpy as np
import pandas as pd
import networkx as nx
from graphviz import Graph
import matplotlib.pyplot as plt
We load our data set.
# we load the csv file
data = pd.read_csv("HW1_asset_prices.csv", sep = ',')
# we see the first 5 rows of our data set
data.head()
We drop the first column:
# we drop the first column
df = data.drop('Date', axis = 1)
df.head()
Now we calculate the correlation matrix as follows:
# we calculate the correlation matrix
corr_matrix = df.corr()
corr_matrix_initial = corr_matrix # store correlation matrix for future questions
corr_matrix_initial
b.
We now copy our correlation matrix to a txt file so as to read it later as we did in the tutorial:
corr_matrix_initial.to_csv('output.txt', index = True, header = True, sep = ' ')
In order to construct the edgelist file as in the tutorial we edit the correlation matrix so as to have all possible edge combinations with their correlation as follows:
# reshape the corr_matrix so as to have all possible edge combinations with their correlations
# include only values with absolute value of correlation greater or equal than 0.01
# convert corr_matrix to columns
corr_matrix = corr_matrix[abs(corr_matrix) >= 0.01].stack().reset_index(name = 'correlations')
# remove diagonal line of matrix
corr_matrix = corr_matrix[corr_matrix['level_0'] != corr_matrix['level_1']]
# concatenate the two columns after sorting
corr_matrix['columns'] = corr_matrix.apply(lambda lamda: '-'.join(sorted([lamda['level_0'], lamda['level_1']])), axis = 1)
# drop duplicates and concatenated columns
corr_matrix = corr_matrix.drop_duplicates(['columns'])
corr_matrix.drop(['columns'], inplace = True, axis = 1)
# final edgelist file
edgelist = corr_matrix
edgelist
We now copy our correlation matrix to a txt file so as to read it later as we did in the tutorial:
edgelist.to_csv('edjelist.txt', index = False, header = False, sep = ' ')
We are now ready to create the graph.
# Create an empty graph structure with no nodes and no edges.
G = nx.Graph() # we use Graph from graphviz
# Read an un-directed graph from a list of edges
G = nx.read_edgelist("edjelist.txt", nodetype = str, data = [('weight', float)])
We now plot the Graph as indicated below and we experiment with several layouts:
plt.subplots(figsize = (20,20))
plt.title("Graph 1 - Circular Layout", fontsize = 20)
nx.draw(G,pos = nx.circular_layout(G), with_labels = True, node_color = 'red', edge_color = 'blue', font_size = 12)
plt.subplots(figsize = (20,20))
plt.title("Graph 1 - Random Layout", fontsize = 20)
nx.draw(G,pos = nx.random_layout(G), with_labels = True, node_color = 'red', edge_color = 'blue', font_size = 12)
plt.subplots(figsize = (20,20))
plt.title("Graph 1 - Spring Layout", fontsize = 20)
nx.draw(G,pos = nx.spring_layout(G), with_labels = True, node_color = 'red', edge_color = 'blue', font_size = 12)
plt.subplots(figsize = (20,20))
plt.title("Graph 1 - Spectral Layout", fontsize = 20)
nx.draw(G,pos = nx.spectral_layout(G), with_labels = True, node_color = 'red', edge_color = 'blue', font_size = 12)
We experimented with the different graph layouts and we chose the circular one (similar to shell). We chose the circular layout as in that layout all of the nodes are connected with each other. We know that all nodes have to be connected with each other, as all of them have a significant correlation with the rest.
With the selected layout, colours and size, we managed to visualise our network better and make it more beautiful.
We also tried random, spring and spectral layout. We rejected even the best of them, which was random, since nodes were positioned randomly and were not clearly visualized.
c.
We are now ready to enhance our plot. We first create a second graph as follows:
# Create an empty graph structure with no nodes and no edges.
G2 = nx.Graph() # we use Graph from graphviz
# Read an un-directed graph from a list of edges
G2 = nx.read_edgelist("edjelist.txt", nodetype = str, data = [('weight', float)])
We now store our weights in order to use them for the thickness of the edges and the size of our nodes.
weights = abs(corr_matrix_initial).sum(axis = 0).to_list() # absolute value of sum of correlations in a node
weights = list(np.asarray(weights) * 20) # we multiple by 20 in order to be more visible
# we use absolute value because if nodes have very high positive and very high negative correlations
# we risk being shown with zero correlations
weights2 = list(nx.get_edge_attributes(G2, 'weight').values())
weights2 = list(np.asarray((weights2)))
We now plot the Graph as indicated below:
plt.subplots(figsize = (20,20))
plt.title("Graph 2", fontsize = 20)
nx.draw(G2, pos = nx.circular_layout(G2), node_size = weights, width = weights2, with_labels = True, node_color = 'red', edge_color = 'blue', font_size = 12)
To sum up, we visualised the different weights through edge thickness and node size.
At the following commands we make use of the graphviz external drawing package for our second graph and call one of its drawing functions.
graphviz_example = Graph(name = 'Graphviz Example', format = 'svg', strict = True)
node_list = list(nx.nodes(G2))
# we restrict only to 9 nodes since the graph
# would then be too large for effective visualisation
for j in range(0,9):
for i in range(0, 9):
if node_list[j] != node_list[i]:
graphviz_example.edge(node_list[j], node_list[i])
graphviz_example